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Abstract 

Background: Advances in whole genome profiling have revolutionized the cancer research field, but at the same 
time have raised new bioinformatics challenges. For next generation sequencing (NGS), these include data storage, 
computational costs, sequence processing and alignment, delineating appropriate statistical measures, and data 
visualization. Currently there is a lack of workflows for efficient analysis of large, MethylCap-seq datasets containing 
multiple sample groups. 

Methods: The NGS application MethylCap-seq involves the in vitro capture of methylated DNA and subsequent 
analysis of enriched fragments by massively parallel sequencing. The workflow we describe performs MethylCap- 
seq experimental Quality Control (QC), sequence file processing and alignment, differential methylation analysis of 
multiple biological groups, hierarchical clustering, assessment of genome-wide methylation patterns, and 
preparation of files for data visualization. 

Results: Here, we present a scalable, flexible workflow for MethylCap-seq QC, secondary data analysis, tertiary 
analysis of multiple experimental groups, and data visualization. We demonstrate the experimental QC procedure 
with results from a large ovarian cancer study dataset and propose parameters which can identify problematic 
experiments. Promoter methylation profiling and hierarchical clustering analyses are demonstrated for four groups 
of acute myeloid leukemia (AML) patients. We propose a Global Methylation Indicator (GMI) function to assess 
genome-wide changes in methylation patterns between experimental groups. We also show how the workflow 
facilitates data visualization in a web browser with the application Anno-J. 

Conclusions: This workflow and its suite of features will assist biologists in conducting methylation profiling 
projects and facilitate meaningful biological interpretation. 



Background 

Advances in whole genome profiling technologies have 
revolutionized the field of cancer research. These technol- 
ogies have facilitated the discovery of potential biomarkers 
for disease development and progression as well as our 
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understanding of the complex, underlying molecular 
mechanisms that lead to cancer. Reduction in costs have 
spurred the adoption of next generation sequencing 
(NGS) platforms which offer greater resolution and sensi- 
tivity compared to traditional microarray profiling [1]. At 
the same time, NGS raises new bioinformatics challenges, 
both practical (e.g. data storage, computational costs) and 
theoretical (e.g. defining appropriate statistical measures). 

A promising application of NGS is the whole-genome 
profiling of epigenetic modifications, including DNA 
methylation. The addition of methyl groups to the 5' 
carbon position of cytosine bases is a major mechanism 
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of epigenetic regulation which participates in reorganiz- 
ing chromatin structure and silencing gene expression 
[2], Epigenetic alterations, such as tumor suppressor 
gene hypermethylation and oncogene hypomethylation, 
are hallmarks of cancer and play a pivotal role in tumor- 
genesis and disease progression [3,4]. 

The DNA methylation profiling approach used in our 
lab, MethylCap-seq involves the in vitro capture of methy- 
lated DNA with the high affinity methyl-CpG binding 
domain of human MBD2 protein and subsequent analysis 
of enriched fragments by massively parallel sequencing 
[5-8]. Benchmarking has shown MethylCap-seq is more 
effective at interrogating CpG islands than antibody-based 
methylated DNA immunoprecipitation sequencing 
(MeDIP-seq) [9]. While optimizing this experimental tech- 
nique, we recognized two potential issues affecting subse- 
quent data analysis. First, unsuccessful or incomplete 
capture reactions can result in the sequencing of non- 
methylated DNA fragments, leading to inconsistencies in 
or the absence of methylation enrichment in a sample. 
Second, poor sequencing library complexity and CpG cov- 
erage limit the statistical power to call differential methyla- 
tion, and ultimately the reproducibility of the dataset. 
Conventional sequencing analysis pipelines often do not 
include assay-dependent quality control assessments. 
Spurious samples reduce analytical power and lead to 
excess "noise" in downstream analyses. 

The challenges to data analysis are real. The numerous 
options for file processing and genome alignment mean 
any particular strategy requires extensive troubleshooting 
and optimization. Large file sizes make data visualization 
exceedingly difficult without the use of expensive com- 
mercial software packages or system resource-intensive 
publicly available programs. In more practical terms, 
MethylCap-seq projects, in particular, would greatly ben- 
efit from the ability to receive rapid feedback of overall 
experimental quality. There is also a lack of workflows 
for efficient analysis of large, MethylCap-seq datasets 
containing multiple sample groups. To address these per- 
tinent issues, we have developed a scalable, flexible work- 
flow for MethylCap-seq Quality Control and secondary 
data analysis which facilitates tertiary analysis of multiple 
experimental groups and data visualization. 

The automated MethylCap-seq workflow has been 
developed over the course of 200 sequencing runs. The 
workflow is scalable in terms of handling studies of dispa- 
rate sample sizes. It is flexible in that unique experimental 
considerations (genome alignment, read bin sizes, test sta- 
tistics) can be addressed by simple modification of several 
operational parameters independent of the scripts respon- 
sible for automating the workflow. Automation is impera- 
tive because of the large number of intermediate steps and 
temporary files required. The workflow incorporates pro- 
ven, existing tools where applicable: e.g., raw read 



processing, the short read aligner, the R environment and 
third party libraries. It further takes advantage of high per- 
formance computing systems for parallel batch job sub- 
missions. This feature is important for scalability and 
computational feasibility. Data visualization is supported 
by Anno-J, a genome annotation visualization program 
and web service viewport. 

Methods 

Patient samples 

A total of 71 ovarian cancer samples, 6 adjacent normal 
tissues as well as 20 tissues collected from patients during 
surgery for benign gynaecological disease were obtained 
from Triservice General Hospital, Taipei, Taiwan. All 
studies involving human ovarian cancer samples were 
approved by the Institutional Review Boards of Triservice 
General Hospital and National Defense Medical Center. 

Methylated-DNA capture (MethylCap-seq) 

Enrichment of methylated DNA was performed with the 
Methylminer kit (Invitrogen) according to the manufac- 
turer's protocol. Briefly, one microgram of sonicated 
DNA was incubated at room temperature on a rotator 
mixer in a solution containing 3.5 micrograms of MBD- 
Biotin Protein coupled to M-280 Streptavidin Dynabeads. 
Non-captured DNA was removed by collecting beads on 
a magnet and washing three times with Bind/Wash Buf- 
fer. Enriched, methylated DNA was eluted from the bead 
complex with 1 M NaCl and purified by ethanol precipi- 
tation. Library generation and 36-bp single-ended 
sequencing were performed on the Illumina Genome 
Analyzer IIx according to the manufacturer's standard 
protocol. Each sample was sequenced on a single lane, 
for a total of 97 lanes. Additional data sets are presented 
for demonstration purposes only. 

MethylCap-seq experimental QC 

The quality control module identifies technical problems 
in the sequencing data and flags potentially spurious 
samples. The module is based on MEDIPs [10], an 
enrichment-based DNA methylation analysis package, 
and provides rapid feedback to investigators regarding 
dataset quality, facilitating protocol optimization prior to 
committing resources to a larger scale sequencing pro- 
ject. Figure 1 illustrates the QC automated workflow. For 
each aligned sequencing file (e.g., the default output of 
Illumina's CASAVA pipeline), duplicate reads are 
removed (a correction for potential PCR artifacts), and a 
stripped, uniquely aligned sequence BED file is loaded 
into an R workspace for processing by the MEDIPS 
library. Three functions are performed on the data: 
Saturation analysis, CpG enrichment calculation, and 
CpG coverage analysis. Saturation analysis performs a 
Pearson correlation coefficient estimation of sequencing 
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Figure 1 Diagram of MethylCap-seq QC workflow Arrows 
indicate order of operation. Green ovals, Data; Blue Rectangles, 
Process. 



library complexity and potential reproducibility. CpG 
enrichment calculation consists of the relative CpG dinu- 
cleotide frequency interrogated by aligned sequence 
reads divided by the relative frequency of CpG dinucleo- 
tides in the reference genome. CpG coverage rate (5x) is 
the fraction of CpG dinucleotides in the reference gen- 
ome sequenced at least five times. The automated work- 
flow produces a QC summary file containing the 
MEDIPs results and sequencing output metrics from the 
Illumina CASAVA pipeline. The QC module utilizes the 
parallel processing capability of a supercomputing envir- 
onment to greatly decrease the time required for analysis. 

Sequence file processing and alignment 

The ability to use multiple custom sequence alignment 
policies facilitates analysis of various genomic regions 
and features. Bowtie, a short read aligner, provides 
many alignment policies and options that allow a great 
deal of customization of the alignment output[ll]. 
While our focus and workflow centers on reporting 
uniquely aligned reads, alternative alignment options are 
used for more customized data analysis. The qseq files 
are preprocessed for a uniquely aligned Bowtie output 
by being converted to FASTA format. The converted 
file is then aligned by Bowtie with options that optimize 
for uniquely aligned reads and output in SAM format. 
Post processing uses various SAMtools [12] commands 
to convert the alignment to BAM format and remove all 
duplicate reads from the alignment before converting 
back to a final SAM alignment. The workflow, illu- 
strated in Figure 2, is concisely handled by a single 



script which passes each intermediate stage of the align- 
ment process to the subsequent stage and outputs a sin- 
gle SAM alignment file and a report of the number of 
reads that were aligned and those which were counted 
as duplicates. Speed is increased by Bowtie's multi- 
threading options and by performing the alignment in a 
supercomputing environment. To achieve alternative 
alignments, Bowtie options can be changed, and differ- 
ent genomic sequences or subsets of genomic sequences 
may be used for alignment. With minor modification 
our workflow can be run with other short read aligners 
that generate SAM files as output. 

Global methylation analysis workflow 

The methylation analysis workflow is outlined in Figure 
2. Chromosomal coordinates of sequence reads are 
parsed from the final alignment output, then counted 
using a specified bin size and read extension length 
(reflecting average fragment size) in order to generate a 
binary file containing bin counts and scaled count values 
(reads per million - rpms). The bin size determines the 
computational resolution of the analysis. We find that a 
bin size of 500 bp provides sufficient analysis resolution 
while smoothing the data statistically. The binary counts 
file is next interrogated by genomic feature (e.g., CpG 
islands, CpG shores, Refseq genes) to generate feature- 
specific count files. The workflow is compatible with 
custom feature files listing genomic loci of interest in 
BED format. In addition, aggregate read count summa- 
ries can be compiled for each type of genomic feature. 
Our approach of binning aligned reads, scaling read 
count values, and/or generating genomic feature-specific 
count files could prove applicable to other enrichment- 
based sequencing methods. For instance, the process 
responsible for filtering counts by genomic features 
might be modified to accept ChlP-seq peak calling 
values. 

Once the samples are binned and genomic features 
are extracted, they are grouped based on biological fac- 
tors, such as known genotype difference, and statistical 
tests are performed to discern if there are significant dif- 
ferences in methylation counts among predefined 
groups of samples. One locus from a genomic feature in 
one group is tested against the same locus in the other 
group for all loci in that genomic feature. The statistical 
test used is dependent on the number of groupings. For 
two groups a Wilcoxon rank-sum test is employed to 
test the distribution of methylation counts for each 
locus across the two groups. We then select significant 
differentially methylated loci by applying a multiple test 
corrected p-value cutoff (False Discovery Rate, FDR). 
Similarly for groupings of more than two biological fac- 
tors, the Kruskal-Wallis test is employed. Statistical test- 
ing of genomic features is a custom workflow 
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Figure 2 Diagram of global methylation analysis workflow. Arrows indicate order of operation and break at decision trees. Green ovals, 
Data; Blue Rectangles, Process. 



implemented in R which utilizes the predefined Wil- 
coxon and Kruskal-Wallis test functions. The output of 
the workflow is a list of loci from each genomic feature 
that passes significance testing. Boxplots are also created 
for the list of significant features for visualization of 
their differential methylation. 

To assess genome-wide changes in methylation patterns 
between experimental groups, we calculate a Global 
Methylation Indicator (GMI) for each individual sample in 
different groups. First, the sample's methylation distribu- 
tion, an average rpm for each CpG content classification, 
is determined. The distribution is obtained as follows: 
each 500 base bin is classified by the CpG content (# of 
CG base sequences, counting any CG base sequences 
straddling the end of the bin and the beginning of the 
next) within the 500 bases it covers. Then within each 
CpG content classification, the average rpm per bin is cal- 
culated by summing the rpms and dividing by the number 
of bins. The difference between two groups is calculated 
by using a t test on the estimation of the area under the 
curve for each individual sample. 

Clustering 

To identify novel classifications of samples indepen- 
dently of predefined biological factors, unsupervised 
clustering of the data can be implemented. Clustering of 



the data is a workflow that takes a data matrix of the 
samples and the rpm value of each locus for a given 
genomic feature. The workflow is implemented in R and 
utilizes various R libraries for matrix manipulation, 
flashClust, and pvclust for unsupervised clustering. 
Adjusted p-values are obtained via multiscale bootstrap 
resampling of the data. Many combinations of correla- 
tion calculations and clustering methods can be imple- 
mented. Our clustering workflow uses the Pearson 
correlation distance measure. It takes as input the "raw" 
rpm data values or rescaled rpms, depending on the fea- 
tures of interest in the dataset. Rescaling the rpms 
involves dividing the rpms of each locus by the average 
rpm for that locus. This allows Pearson correlation to 
evenly weight both the low and high rpm values. Using 
the raw rpms causes Pearson correlation to more heavily 
weight the high rpms. The default clustering method of 
the workflow is that of McQuitty, but R provides any 
number of additional choices. Our workflow also imple- 
ments data selection criteria that enforce a minimum 
coefficient of variation (CV) threshold in combination 
with minimum average rpm threshold for each locus. In 
tandem with the dendrograms, heatmaps are also pro- 
duced to help visualize the relationship between the 
clustering sample members. This entire workflow, 
including all combinations of selection criteria and all 
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genomic features of interest, is completed in a single 
script. 

Because we produce a variety of dendrograms through 
the use of various genomic features and loci selection cri- 
teria, it is useful to see if the membership of a significant 
group is conserved throughout the dendrograms that were 
created using other genomic features and even within 
genomic features analyzed with varying selection criteria. 
To easily visualize the location of a certain sample group's 
membership in other dendrograms, we use different colors 
to track the membership of that group through alternative 
dendrograms that are produced for different genomic fea- 
tures and selection criteria. Tracking the membership of a 
group is accomplished by supplying the membership of 
that group to a color function that can be applied to sub- 
sequent dendrograms through the dendrapply function 
in R. 

Data visualization 

In our workflow, we have incorporated Anno-J, a REST- 
based Web 2.0 application for the visualization of deep 
sequencing information and other genomic annotations 
[13]. Anno-J is capable of streaming all necessary applets 
and scripts to the user, providing immediate and installa- 
tion-less viewing within a user's web browser. This facili- 
tates the fast, real-time and interactive visualization of 
multiple data sets by users with access to any server host- 
ing Anno-J. Data visualization within Anno-J uses tracks, 
discrete rows of graphs, each of which corresponds to a 
particular set of data. Our workflow incorporates a num- 
ber of custom scripts which allow quick conversion of bin- 
ary and raw text read counts and SAM files to various 
Anno-J track formats, including standard mask and read 
tracks. These scripts extract from read count files the loca- 
tion and rpm, and from SAM files the location, sequence 
count and strand identifier, and generate Anno-J read 
track format files. With minor modification, the scripts 
could be used to generate data tracks compatible with the 
UCSC Genome Browser. For the Anno-J experiment 
tracks, a scheduled service loads any new files from a 
shared folder into our database using a prescribed data 
format. Each track is assigned a unique identifier and 
properties for experiment type (e.g., methylation, small 
RNA) and track type (e.g., read, mask). The Anno-J web 
application will configure the browser with specified tracks 
based on these properties. The browser calls web services 
which return formatted data for each track, filtered by the 
currently viewable portion of the chromosome. 

Additionally, we have incorporated a custom algorithm 
which allows conversion of binary and raw text read 
counts files to a custom discretized methylation heatmap 
track format. The heatmap track format modifies con- 
straints and features of the Anno-J mask track format to 
allow generation of individual rows of heatmap data. 



Discretized methylation heatmap tracks are created by 
percentile ranking binned rpm values from binary or raw 
text read counts files, and then assigning color gradient 
based upon rank. Generation of the final discretized heat- 
map is a matter of stacking multiple heatmap tracks 
together. 

Results and discussion 

Experimental quality control 

The automated MethylCap-seq workflow has been devel- 
oped over the course of 200 sequencing runs. It has been 
applied to human solid tumors (e.g., breast, ovarian, endo- 
metrial, and hepatocellular carcinoma) and blood cancers 
(e.g., acute myeloid leukemia, chronic lymphocytic leuke- 
mia) as well a number of mouse cancer models. Though 
untested in that context, our analysis workflow should be 
applicable to other enrichment-based methylation assays 
such as MeDIP-seq studies. 

The QC workflow runs immediately after the sequencing 
experiment has been transferred from the Illumina Real 
Time Analysis (RTA) pipeline. It calls several functions of 
the R package MEDIPS [10] and reads the Illumina RTA 
run summary output. From our QC workflow, we have 
found the following parameters considered collectively can 
flag problematic samples: CpG enrichment, saturation, 
CpG coverage, and alignment rate. Even valid samples 
occasionally fail a single parameter; thus, we typically 
exclude those which fail two or more parameters. QC 
results from a large ovarian cancer study dataset (97 
patient samples) are shown in Figure 3A-D. 2 of 97 lanes 
of data were excluded from further analysis. In a second 
large cancer dataset (105 patients) where the majority of 
samples were sequenced on multiple lanes (207 in total), 
43 (20.8%) qualified for exclusion. Sequencing of new 
libraries generated for 12 samples with prior insufficient 
aligned reads all failed the QC again, demonstrating how 
sample intrinsic factors (such as DNA quality or integrity) 
dramatically impact the quality of MethylCap sequencing 
data. CpG enrichment, the frequency of CpG dinucleotides 
observed in the sequenced sample compared to the 
expected frequency in the reference genome, is likely the 
most significant QC parameter because it can indicate fail- 
ure of the initial methylation-capture step. As MethylCap- 
seq is an enrichment-based approach, identifying failures 
in enrichment is imperative. We observed an average CpG 
enrichment value of 2.70 ± 0.35 in the ovarian cancer data- 
set. In general, enrichment values range from 2 - 3.5 and 
show similar distributions in samples from normal and 
malignant tissues as shown for the ovarian cancer dataset 
(Figure 3C). Enrichment values from input samples (non- 
captured DNA) are less than or near 1. We routinely 
exclude samples with CpG enrichment values less than 
1.4. Such samples usually have low saturation values 
as well (less than 0.5), a measure of the statistical 
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Figure 3 QC analysis of a large sample cohort. The QC values associated with a large ovarian cancer study dataset are presented in box plots 
according to sample type (normal = 6, benign = 20, malignant = 71). In each plot, threshold cutoff values for sample exclusion are indicated by 
hatched horizontal reference lines. A, Box plot of MethylCap-seq unique read alignment rate. The fraction of pass-filter reads which align 
uniquely to the human reference genome. B, Box plot of Saturation analysis, a Pearson correlation coefficient estimation of sequencing library 
complexity and potential reproducibility. C, Box plot of CpG enrichment, the relative CpG dinucleotide frequency interrogated by aligned 
sequence reads divided by the relative frequency of CpG dinucleotides in the reference genome. D, Box plot of CpG coverage rate (5x), the 
fraction of CpG dinucleotides in the reference genome sequenced at least five times. 



reproducibility of the dataset, suggesting that the methyla- 
tion calls would be difficult to reproduce if the library was 
resequenced. 

Differential methylation analysis of multiple sample 
groups 

Current strategies for enrichment-based sequencing dif- 
ferential methylation analysis have been limited to indi- 
vidual pair-wise sample comparisons such as tumor 
versus normal [9] or comparisons of multiple samples 
(in pair-wise fashion) to a common normal reference 
sample [14]. Thus a salient feature of our workflow is 
the ability to compare methylation profiles of multiple 
samples in two or more biological groups. Significance 
testing is performed in R with the non-parametric Wil- 
coxon (two groups) or Kruskal-Wallis (> two groups) 
tests. An example of promoter methylation profiling 



analysis for four AML patient groups is shown in Figure 
4A-B. Results of individual features can be visualized by 
whisker plots as in Figure 4A which shows differential 
methylation of the NR_033205 transcript promoter. The 
workflow performs unsupervised clustering in order to 
identify novel classifications of samples (Figure 4B, 
Additional file 1). In methylation profile clustering ana- 
lysis, data selection criteria are enforced in order to pare 
down the number of loci being used for clustering 
within each genomic feature. The rationale for this 
approach is that it allows the clustering to be performed 
on only the loci with the largest differences among sam- 
ples; the minimum rpm value for each locus removes 
loci that were not pulled down well during sequencing 
and thus are expected to be rather noisy. Hierarchical 
clustering of promoter regions passing threshold criteria 
(avg rpm > 10 and CV > 5) reveals four distinct patient 
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Figure 4 Methylation analysis of multiple sample groups. A, Boxplot of noncoding RNA NR_033202 promoter methylation in four groups of 
AML patients. Multiple-testing corrected non-parametric Kruskal-Wallis analysis of variance p-value is shown. B, Hierarchical clustering 
dendrogramand heatmap of methylation in gene promoters among four groups of AML patients. The feature threshold criteria were avg rpm > 
10 and CV > 5. The full dendrogram generated in R with the pvclust package is included in Additional file 1. 
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groupings (Figure 4B, Additional file 1). Combinations 
of the selection criteria produce many different dendro- 
grams of the data for evaluation and serve as a method 
for exploration of novel differentially methylated loci 
that may contribute to biological factors. To determine 
if membership of a significant group is conserved 
among dendrograms created using other genomic 



features or within genomic features analyzed with vary- 
ing selection criteria, we implement a group tracking 
function as shown in Figure 5. If the membership of a 
group is conserved as we track it through alternative 
dendrograms, it is more likely to be biologically signifi- 
cant rather than an artifact of the specific clustering 
procedure. 
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Figure 5 Tracking groups in hierarchical clustering. The top panel shows the initial clustering dendrogram of methylation in gene promoters 
among four groups of AML patients (feature threshold criteria: average rpm > 0, CV > 3). Membership in the two largest clusters is then tracked 
in subsequent dendrograms by blue (left cluster) and red (right cluster) sample labels. Black labels indicate samples not present in the two 
largest clusters. Middle panel, clustering dendrogram of CpG island methylation (feature threshold criteria: average rpm > 0, CV > 3). Bottom 
panel, clustering dendrogram of methylation in CpG island shores (feature threshold criteria: average rpm > 0, CV > 3). 
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Assessing genome-wide methylation patterns 

To assess genome-wide changes in methylation patterns 
between experimental groups independent of genomic fea- 
tures, we calculate a Global Methylation Indicator (GMI) 
for each individual sample in different groups. Comparison 



of two or more Methylation Distributions provides infor- 
mation regarding the differences in average methylation at 
various CpG classifications. A representative Methylation 
Distribution plot presented in Figure 6A shows the distri- 
butions of a normal ovarian tissue, an ovarian carcinoma, 
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Figure 6 Assessment of global DNA methylation patterns. A, Methylation Distribution plot of a normal ovarian tissue (blue circles), an ovarian 
carcinoma (red circles), and an in vitro methylated (Sssi) positive control sample. B, Global Methylation Indicator (GMI) values of the three samples. 
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and an in vitro methylated positive control. Likewise, com- 
paring two or more GMIs may provide information 
regarding gross differences in global methylation. A GMI 
plot corresponding to the three samples in Figure 6A is 
provided in Figure 6B. 

MethylCap-seq data visualization 

Effective data visualization can bridge the divide 
between computational and experimental biologists 
engaged in integrated analysis projects. Visual interpre- 
tation of patterns may permit the researcher to observe 
phenomena which computational analysis do not detect. 
The data workflow prepares samples for visualization in 
a web browser with the application Anno-J (Figure 7). 
In Anno-J, samples are represented as individual data 
tracks which can traversed, scaled and rearranged inter- 
actively by the user in real-time. Individual sequence 
reads can be visualized at single-base resolution as 



demonstrated in the top panel of Figure 7 which depicts 
methylation read data at the EPPK1 gene locus in eight 
AML patient samples. To interact with data at a much 
broader resolution, we developed a custom methylation 
heatmap data track. The bottom panel of Figure 7 
shows a methylation heatmap of the HOXA gene cluster 
in breast cancer cells (n = 35) and normal breast epithe- 
lial cell lines (n = 5). 

Conclusions 

In this paper, we presented a scalable, flexible workflow 
for performing MethylCap-seq Quality Control, secondary 
data analysis, tertiary analysis of multiple experimental 
groups, and data visualization in the web service viewport, 
Anno-J. As the cancer epigenetics field further expands 
into next generation sequencing, our workflow should 
assist biologists in conducting methylation profiling pro- 
jects and facilitate meaningful biological interpretation. 
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Figure 7 Data visualization with Anno-J. Top, Methylation read data at single base resolution. Data depicted are the 5' end of the EPPK1 gene 
(track 1) and associated CpG island (track 2) in eight AML patient samples (tracks 3-10). Bottom, methylation heatmap of the HOXA gene cluster 
in 35 breast cancer cell and five normal breast epithelial cell lines (last five rows). 
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Additional material 



Additional file 1: Promoter methylation hierarchical clustering 
dendrogram 
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